height weight age male
Min. : 53.98 Min. : 4.252 Min. : 0.00 Min. :0.0000
1st Qu.:125.09 1st Qu.:22.008 1st Qu.:12.00 1st Qu.:0.0000
Median :148.59 Median :40.058 Median :27.00 Median :0.0000
Mean :138.26 Mean :35.611 Mean :29.34 Mean :0.4724
3rd Qu.:157.48 3rd Qu.:47.209 3rd Qu.:43.00 3rd Qu.:1.0000
Max. :179.07 Max. :62.993 Max. :88.00 Max. :1.0000
McElreath focused on the data of adults (age >= 18) and fitting a linear model to those data, but we will now use the entire data set. As a heads up, there are two lessons I think we need to learn, in addition to just getting practice fitting models to data and working with the posteriors.
Linear models are often useful phenomenological descriptions of the data, even when they are not not good mechanistic explanations. Moreover, we can often shoe-horn non-linear relationships into a linear form to good benefit.
Reformulating a model or problem into a mathematically equivalent form can have the benefits of facilitating model-fitting (i.e., removing some problems with estimation) and simplifying the interpretation of parameters.
We’ll get to both, but first, let us first fit a simple, linear model.
A simple model of weight by height
Let me propose this form and priors for a model of weight against height:
\[
\begin{align}
y_i & \sim \text{Normal}(\mu_i, \sigma) \\
\mu_i & = \alpha + \beta x_i \\
\alpha & \sim \text{Normal}(10,10) \\
\beta & \sim \text{Normal}(2,1) \\
\sigma & \sim \text{Exponential}(1)
\end{align}
\] where \(y_i\) is the weight of individual \(i\) and \(x_i\) is their height
Simulating possible relationships from a linear model: Prior checks
Your first task is to simulate and then plot 100 possible relationships (lines) from this model, given both the structure and the priors I have provided. We’ll assume that the weights vary from a low of 5 kg to a high of 50 kg. Also, please do this before looking at the data; that would be cheating! ;-)
x <-50:180# heights# parametersa <-rnorm(n=100, mean=10, sd=10)b <-rnorm(n=100, mean=2, sd=1)# plot y by x for each of the 100 parameter draws plot(NULL, xlim =c(50,180), ylim =c(-50, 400), ylab="weight (kg)", xlab="height (cm)")abline(h =0)for(i in1:100){lines(x=x, y=a[i] + b[i]*x, col =rgb(0,0,1, alpha =1/5))}
Having done so, do you have any concerns with my priors? If so, please tell me how you are changing them.
# I thought the intercept and slope were too variable and extreme# parametersa <-rnorm(n=100, mean=0, sd=5)b <-rnorm(n=100, mean=0.25, sd=1/4)# plot y by x for each of the 100 parameter draws plot(NULL, xlim =c(50,180), ylim =c(-50, 400), ylab="weight (kg)", xlab="height (cm)")abline(h =0)for(i in1:100){lines(x=x, y=a[i] + b[i]*x, col =rgb(0,0,1, alpha =1/5))}
Fitting a simple linear model with grid approximation
Now that you are happy with the priors, please fit this model to the actual !Kung data using a grid approximation. Be sure to at least examine the posterior of the parameters.
plot(sample.a, sample.b, pch =16, col =rgb(0,0,1, alpha =1/10))
Fitting a simple linear model with quap()
Let’s admit it, grid approximation is clunky and slow. I, for one, keep finding that I haven’t considered a wide enough range of parameters, but also that my grid is too coarse. Plus, it’s quite easy to screw up the coding. The quadratic approximation that McElreath provided addresses some of these problems and gets us closer to using more modern, general-purpose tools. So let’s repeat our above analysis using quap().
# quap versionm <-quap(alist( weight ~dnorm(mu, sigma), mu <- a + b*height, a ~dnorm(0, 5), b ~dnorm(0.25, 1/4), sigma ~dexp(1) ), data=df)precis(m)
mean sd 5.5% 94.5%
a -32.195644 1.067846605 -33.902269 -30.489019
b 0.490836 0.007579786 0.478722 0.502950
sigma 4.970433 0.150200163 4.730384 5.210482
samples <-extract.samples(m)hist(samples$a)
hist(samples$b)
hist(samples$sigma)
plot(b ~ a, data = samples, pch =16, col =rgb(0,0,1, alpha =1/10))
Interpretting parameters
Examine the posterior of the key parameters (plots are helpful!) telling me what you think they tell you.
–> the weight at zero height (??) is about -32 kg.
–> for every cm increase, the weight increases about 0.5 kg, assuming the linear model is good.
–> there is a good deal of variation around the expected line since the standard deviation is almost 5 kg.
If you have not yet examined the potential relationship between the parameters \(a\) and \(b\), do so here and tell me what you conclude.
–> There is a strong, negative correlation between slope and intercept. If the intercept is higher, the slope needs to be lower to “hit” the data, and vice versa. That is, these two parameters cannot be estimated independently. Knowing the value of one tells you about the value of the other.
Revising our model structure
Given the correlation we observed, let me proposal a mathematically equivalent model structure: center your \(x\)-axis on the mean of those \(x\) values. That is, let \(hc_i = h_i - \bar{h}\) and use \(hc_i\) as your predictor variable.
df$hc <- df$height -mean(df$height)summary(df$hc)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-84.29 -13.17 10.33 0.00 19.22 40.81
You may need to reconsider your priors. Do so and then fit the model with quap(). (You are welcome to repeat the fit with grid-approximation, too, if you like, but it’s not necessary.)
# examine priorsx <--85:40# parametersa <-rnorm(n=100, mean=50, sd=10)b <-rnorm(n=100, mean=0.25, sd=1/4)# plot y by x for each of the 100 parameter draws plot(NULL, xlim =c(-85,40), ylim =c(-10, 100), xlab="difference in height from mean (cm)", ylab="weight (kg)")abline(h =0)for(i in1:100){lines(x=x, y=a[i] + b[i]*x, col =rgb(0,0,1, alpha =1/5))}
# quap versionm.cent <-quap(alist( weight ~dnorm(mu, sigma), mu <- a + b*hc, a ~dnorm(50, 10), b ~dnorm(0.25, 0.25), sigma ~dexp(1) ), data=df)precis(m.cent)
mean sd 5.5% 94.5%
a 35.6165000 0.212660807 35.2766270 35.9563731
b 0.5014572 0.007709588 0.4891358 0.5137786
sigma 4.9611748 0.149389260 4.7224219 5.1999277
plot(b ~ a, data = samples.cent, pch =16, col =rgb(0,0,1, alpha =1/10))
How did this change your results? In particular, what happened to the relationship between the parameters \(a\) and \(b\)? Why? Similarly, how did the interpretation of the parameter \(a\) change?
–> the slope and sigma are virtually unchanged, but the intercept is now much higher, as expected.
–> the slope now tells us the expected weight for a person at the average height whereas before it was the weight of a person at zero height.
–> the correlation between a and b is gone. A change in the intercept no longer requires a change in the slope to “hit” the data.
Comparing the model expectation to the data
So far we have not compared the model-expected relationship between weight and height and the actual relationship observed in the data. Let’s do so now.
plot(weight ~ hc, data=df, type ="n")for(i in1:1000){curve(samples.cent$a[i] + samples.cent$b[i]*x, col =rgb(0,0,1, alpha =1/50), add =TRUE)}points(weight ~ hc, data=df)
# Alternative using link functionnewdat <-link(fit=m.cent, data =data.frame(hc =-85:40), n=1000)plot(weight ~ hc, data=df, type ="n")for(i in1:1000){lines(x=-85:40, y=newdat[i,], col =rgb(0,0,1, alpha =1/50))}points(weight ~ hc, data=df)
How well does our model describe the general features of the data?
–> Poorly. There is a clear saturating curve in the data, but our expectation is a straight line. Thus, we underpredict height for low and very high heights and overpredict for most of the heights in the middle.
An allometric linear model
Suppose a colleague of yours who works on allometry glances at the practice problems just above. Your colleague exclaims, “That’s silly. Every know that it’s only the logarithm of body weight that scales with height!” Let’s take your colleague’s advice and see what happens.
First, revise the model to accommodate a linear relationship between height and the log of weight.
Second, be sure to examine the consequence of your priors, adjusting them as needed.
x <--85:40xbar <-138# parametersa <-rnorm(n=100, mean=2.5, sd=1/10)b <-rnorm(n=100, mean=1/50, sd=1/200)# plot y by x for each of the 100 parameter draws plot(NULL, xlim =c(-85, 40), ylim =c(1, 4), xlab="height (difference from mean; cm)", ylab="weight log(kg)")abline(h=0)for(i in1:100){lines(x=x-xbar, y=a[i] + b[i]*(x-xbar), col =rgb(0,0,1, alpha =1/5))}
Third, fit the model to the data. Let’s use quap() again.
# quap versionxbar <-mean(df$height)df$hc <- df$height - xbardf$lwt <-log(df$weight)m1 <-quap(alist( lwt ~dnorm(mu, sigma), mu <- a + b*hc, a ~dnorm(2.5, 1/10), b ~dnorm(1/50, 1/200), sigma ~dexp(1) ), data=df)precis(m1)
mean sd 5.5% 94.5%
a 3.44034496 0.0045910790 3.43300753 3.44768239
b 0.02050079 0.0001665079 0.02023468 0.02076691
sigma 0.10715771 0.0032484077 0.10196613 0.11234929
samples <-extract.samples(m1)
Fourth, interpret the posterior of the parameters and interpret them as well as you can.
hist(samples$a)
hist(samples$b)
hist(samples$sigma)
plot(a ~ b, data = samples, col=rgb(0,0,1, alpha=1/20))
–> Again, a tight mean weight around exp(3.44) = 31 kg at the average height.
–> The slope of around 0.021 implies we get a exp(0.021) = 1.021-fold increase in mass (kg) with every 1 cm increase in height.
Fifth, graph the relationship between weight and height predicted by this new model, as well as the actual observations.
# linear on the log-x scaleplot(lwt ~ hc, data=df)for(i in1:1000){curve(samples$a[i] + samples$b[i]*x, col =rgb(0,0,1, alpha =1/50), add =TRUE)}curve(mean(samples$a) +mean(samples$b)*x, col ="red", add =TRUE)
# Alternative using link functionhcs <-seq(-85, 42, length.out =100)newdat <-link(fit=m1, data =data.frame(hc = hcs), n=1000)plot(lwt ~ hc, data=df)for(i in1:1000){lines(x=hcs, y=newdat[i,], col =rgb(0,0,1, alpha =1/50))}curve(mean(samples$a) +mean(samples$b)*x, col ="red", add =TRUE)
# plot on the weight scaleplot(weight ~ height, data=df)for(i in1:1000){lines(x=hcs+xbar, y=exp(newdat[i,]), col =rgb(0,0,1, alpha =1/50))}curve(exp(mean(samples$a) +mean(samples$b)*(x-xbar)), col ="red", add =TRUE)
Sixth, what can we conclude from this analysis?
–> it seems the colleague was right, log(weight) is linearly related to the height (or weight is exponentially related to height). However, a close look at the line and the data suggests it’s not a perfect description of the data.
–> There is no mechanism to this graph, or not really. But it does capture the general features of the data pretty well.